Skip to content

fix: converter hf now handles byte characters. Closes #188 #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 22, 2025

Conversation

antoine-sac
Copy link
Contributor

The converter-tokenizer-hf is now aware of byte characters (such as "<0x0A>" and parses them correctly as the actual character (such as a newline "\n".

This is useful for mistral, tinyllama, and others using the same tokenizing method.

See ggml-org/llama.cpp#4622 for more context.

Fix #188.

@b4rtaz b4rtaz merged commit ec2cb7f into b4rtaz:main Mar 22, 2025
3 checks passed
@b4rtaz
Copy link
Owner

b4rtaz commented Mar 22, 2025

Thanks @antoine-sac!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vocabulary containing special "byte tokens" not converted correctly
2 participants